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SUMMARY  Numerical  function  generators  (NFGs)  realize  arithmetic 
functions,  such  as  ex,  sinfrcx),  and  y/x,  in  hardware.  They  are  used  in  ap¬ 
plications  where  high-speed  is  essential,  such  as  in  digital  signal  or  graph¬ 
ics  applications.  We  introduce  the  edge- valued  binary  decision  diagram 
(EVBDD)  as  a  means  of  reducing  the  delay  and  memory  requirements  in 
NFGs.  We  also  introduce  a  recursive  segmentation  algorithm,  which  di¬ 
vides  the  domain  of  the  function  to  be  realized  into  segments,  where  the 
given  function  is  realized  as  a  polynomial.  This  design  reduces  the  size 
of  the  multiplier  needed  and  thus  reduces  delay.  It  is  also  shown  that  an 
adder  can  be  replaced  by  a  set  of  2-input  AND  gates,  further  reducing  de¬ 
lay.  We  compare  our  results  to  NFGs  designed  with  multi-terminal  BDDs 
(MTBDDs).  We  show  that  EVBDDs  yield  a  design  that  has,  on  the  av¬ 
erage,  only  39%  of  the  memory  and  58%  of  the  delay  of  NFGs  designed 
using  MTBDDs. 

key  words :  Edge-valued  binary  decision  diagrams  (EVBDDs),  recur¬ 
sive  segmentation,  piecewise  polynomial  approximation,  numerical  func¬ 
tion  generators  (NFGs),  programmable  architecture. 

1.  Introduction 

The  computation  of  arithmetic  functions,  such  as  trigono¬ 
metric,  logarithmic,  square  root,  and  reciprocal  functions, 
has  a  long  history.  More  than  150  years  ago,  Charles  Bab¬ 
bage  designed  the  difference  machine  to  compute  polynomi¬ 
als  that  could  be  used  to  approximate  other  functions,  such 
as  logarithms  [27],  The  introduction  of  electronic  general 
purpose  computers  and  special  languages  like  FORTRAN 
improved  upon  the  speed  and  ease  at  which  arithmetic  func¬ 
tions  could  be  calculated.  Well  into  the  beginning  of  the  2 1  st 
century,  the  goals  remain  the  same  -  to  compute  functions 
at  high-speed  and  with  relative  ease  by  the  user. 

In  this  paper,  we  propose  a  design  of  a  programmable 
numerical  function  generator  (NFG)  that  computes  an  arith¬ 
metic  function  in  fixed-point  representation.  It  takes  advan¬ 
tage  of  large  quantities  of  inexpensive,  programmable  logic 
available  in  modern  FPGAs.  Because  of  FPGAs,  there  has 
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been  recent  interest  in  NFGs  that  realize  polynomials  that 
approximate  the  given  function  [4,  6-8,  18,  25, 26].  In  the 
past,  designs  have  used  uniform  segments  across  the  func¬ 
tion’s  domain,  where,  in  each  segment,  a  (generally  differ¬ 
ent)  polynomial  is  used  to  realize  the  function.  By  decreas¬ 
ing  the  segment  size,  any  desired  accuracy  can  be  achieved. 
Accuracy  also  depends  on  the  order  of  the  polynomial  used 
in  the  approximation.  Linear  [26]  and  higher  order  approx¬ 
imations  have  been  considered  [4, 6-8, 18, 25],  Linear  ap¬ 
proximation  and  uniform  segmentation  are  well  suited  for 
some  functions  like  2X,  sin(jt.r),  and  cos(7u),  but  are  not  ap¬ 
propriate  for  other  functions  like  \J  —  ln(.r)  and  the  entropy 
function  —  (;dog9(x)  +  (1  —  ;c)logi(l  —a)).  For  such  func¬ 
tions,  non-uniform  segmentation  produces  realizations  with 
the  desired  accuracy  [3],  In  this  case,  segments  are  chosen 
to  be  as  wide  as  possible  while  still  achieving  the  specified 
accuracy.  As  a  result,  the  segment  width  is  adapted  to  the 
local  characteristics  of  the  function  -  wide  segments  where 
the  function  is  nearly  linear  and  narrow  segments  where  the 
function  is  nonlinear.  It  follows  that  this  yields  the  fewest 
segments  needed  to  achieve  the  given  accuracy.  Since  the 
coefficients  of  the  approximating  polynomial  are  stored  in 
local  memory,  non-uniform  segmentation  offers  a  way  to 
reduce  the  memory  requirements  of  an  NFG  realized  by  a 
memory-constrained  FPGA. 

In  this  paper,  we  propose  a  new  segmentation  algorithm 
and  a  new  programmable  architecture.  Specifically,  we  pro¬ 
pose  the  edge- valued  binary  decision  diagram  (EVBDD)  as 
a  way  to  design  a  programmable  circuit  that  maps  a  given  X 
into  a  segment,  where  the  function  f(X)  is  realized  by  a  spe¬ 
cific  polynomial.  We  also  propose  a  recursive  segmentation 
algorithm  that  produces  segments  whose  widths  are  chosen 
especially  to  simplify  the  hardware,  while  adapting  to  the 
degree  of  nonlinearity  of  f(X).  This  approach  is  a  hybrid 
of  an  approach  [10, 11|  that  uses  a  special  (non-optimum) 
non-uniform  segmentation  and  another  [22,24]  that  uses  the 
optimum  non-uniform  segmentation. 

This  paper  is  divided  as  follows.  The  next  section  intro¬ 
duces  preliminary  concepts,  including  an  introduction  to  the 
EVBDD.  In  Section  3,  we  discuss  the  segmentation  of  the 
domain,  including  a  new  recursive  segmentation  method.  In 
Section  4,  we  discuss  the  architecture  of  our  proposed  NFG. 
Experimental  results  are  shown  for  the  realization  of  various 
functions  in  Section  5.  Here,  we  compare  our  results  to  an¬ 
other  programmable  architecture  that  is  designed  using  the 
multi-terminal  BDD  (MTBDD).  Finally,  in  Section  6,  we 
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provide  concluding  remarks. 

2.  Preliminary  Concepts 

2. 1  Number  Representation  and  Error 

The  value  of  the  input  variable  X  and  the  function  value 
f(X)  are  represented  in  fixed  point.  Specifically, 

Definition  1:  X  has  a  fixed-point  representation  X  = 

(xi- 1  xi- 2  ■  ■  ■  x\  xq.  X-i  X—2  ■  ■  ■  X-m) 2,  where  Xj  £  {0, 1},  / 

is  the  number  of  bits  in  the  integer  part,  and  m  is  the  number 
of  bits  in  the  fractional  part.  Each  bit  x,-  contributes  2'x,  to 
the  value  of  X  except,  x/_i,  which  contributes  -21  I.r;..|. 
That  is,  the  fixed-point  representation  is  in  2’s  complement. 

Definition  2:  Error  is  the  absolute  difference  between  the 
exact  value  and  the  value  produced  by  the  hardware.  Ac¬ 
ceptable  error  is  the  maximum  error  that  an  NFG  may  as¬ 
sume;  it  is  usually  a  specification  to  be  satisfied  by  the  hard¬ 
ware.  Approximation  error  is  the  error  caused  by  a  func¬ 
tion  approximation.  Acceptable  approximation  error  is 
the  maximum  approximation  error  that  a  function  approxi¬ 
mation  may  assume.  Rounding  error  is  the  error  caused  by 
removing  certain  least  significant  bits  either  by  rounding  or 
by  truncation. 

Definition  3:  Precision  is  the  total  number  of  bits  for  a 
binary  fixed-point  representation.  Specifically,  n-bit  pre¬ 
cision  specifies  that  n  bits  are  used  to  represent  the  number; 
that  is,  n  =  1  +  m.  We  assume  that  an  n-bit  precision  NFG 
has  an  «-bit  input. 

Definition  4:  Accuracy  is  the  number  of  bits  in  the  frac¬ 
tional  part  of  a  binary  fixed-point  representation,  m-bit  ac¬ 
curacy  specifies  that  m  bits  are  used  to  represent  the  frac¬ 
tional  part  of  the  number.  When  the  maximum  error  is  2~m, 
the  accuracy  can  be  expressed  as  1  unit  in  the  last  place 
(ULP).  In  this  paper,  an  m-bit  accuracy  NFG  is  an  NFG 
with  an  m-bit  fractional  part  of  the  input,  an  m-bit  fractional 
part  of  the  output,  and  a  1  ULP  error. 

2.2  Edge-Valued  Binary  Decision  Diagram 

Definition  5:  A  binary  decision  diagram  (BDD)  [2]  is  a 
rooted  directed  acyclic  graph  representing  a  logic  function: 
{0, 1}"  — >  {0. 1}.  The  BDD  is  obtained  by  repeatedly  apply¬ 
ing  the  Shannon  expansion  to  the  logic  function.  Each  func¬ 
tion,  including  the  original  function  and  all  sub-functions  re¬ 
sulting  from  applying  the  Shannon  expansion,  is  represented 
by  a  non-terminal  node,  unless  that  function  is  a  trivial  func¬ 
tion,  0  or  1,  in  which  case,  it  is  represented  by  a  terminal 
node.  A  non-terminal  node  has  two  outgoing  edges,  a  fl¬ 
edge  and  a  1-edge,  that  correspond  to  the  values  of  input 
variables.  A  terminal  node  has  no  outgoing  edges. 

Definition  6:  A  multi-terminal  BDD  (MTBDD)  [5, 19]  is 

an  extension  of  the  BDD,  and  represents  an  integer  function: 
{0, 1}"  — >  Z,  where  Z  is  a  finite  set  of  integers.  Specifically, 
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(a)  Function  table. 


Fig.  1  MTBDD  and  EVBDD  for  an  integer  function. 

it  is  a  BDD  in  which  the  terminal  nodes  are  not  restricted  to 
0  and  1 .  Rather,  they  are  labeled  by  integer  values. 

Definition  7 :  An  edge-valued  BDD  (EVBDD)  [9, 1 9]  is  an 

extension  of  the  BDD,  and  represents  an  integer  function. 
An  EVBDD  consists  of  one  terminal  node  representing  0 
and  non-terminal  nodes  with  a  weighted  1  -edge,  where  the 
weight  is  an  integer.  Note  that,  in  the  EVBDD,  0-edges  have 
weight  0. 

Example  1:  Fig.  1(b)  and  (c)  show  an  MTBDD  and  an 
EVBDD  for  the  integer  function  /  defined  by  Fig.  1(a). 
In  Fig.  1(b)  and  (c),  dashed  lines  and  solid  lines  denote  fl¬ 
edges  and  1 -edges,  respectively.  Note  that  the  EVBDD  has 
weighted  1 -edges.  In  the  MTBDD,  terminal  nodes  repre¬ 
sent  function  values.  Thus,  to  evaluate  the  function,  we  tra¬ 
verse  the  MTBDD  from  the  root  node  to  a  terminal  node 
according  to  the  input  values,  and  obtain  the  function  value 
(an  integer)  from  the  terminal  node.  On  the  other  hand,  in 
the  EVBDD,  we  obtain  the  function  value  by  summing  the 
weights  of  the  edges  traversed  from  the  root  node  to  the  ter¬ 
minal  node.  (End  of  Example) 

3.  Piecewise  Polynomial  Approximation  Based  on  Non- 
uniform  Segmentation 

3 . 1  Uniform  and  Non-uniform  Segmentations 

The  realization  of  a  function  f(X)  in  hardware  is  done  by 
dividing  the  domain  X  of  the  function  into  segments.  We 
choose  non-uniform  segments,  which  means  that  each  seg¬ 
ment  width  is  chosen  so  that  the  given  acceptable  approx¬ 
imation  error  ea  is  just  met.  Therefore,  if  the  function  is 
close  to  linear  in  a  linear  approximation,  then  a  wide  seg¬ 
ment  occurs.  And,  if  the  function  is  highly  nonlinear,  a  nar¬ 
row  segment  occurs.  In  each  case,  the  maximum  error  in 
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(a)  Uniform  segmentation.  (b)  Non-uniform  segmentation. 


Fig.  2  Uniform  and  non-uniform  segmentations  of  arcsin(X). 


the  segment  is  £a.  Finding  an  optimum  segmentation  is  an 
important  part  of  the  design  process.  Within  each  segment, 
a  polynomial  is  used  to  approximate  f(X)  in  that  segment. 
If  the  segment  is  sufficiently  small,  the  polynomial  will  ap¬ 
proximate  f(X)  to  the  desired  accuracy.  For  example,  if 
the  segment  is  very  small,  even  a  linear  approximation  will 
be  sufficiently  accurate.  However,  when  the  segment  size 
is  small,  too  many  segments  may  be  required,  and  there 
may  not  be  enough  memory  to  store  all  the  required  co¬ 
efficients.  Thus,  for  memory-constrained  implementations 
(e.g.  FPGA),  it  is  important  to  reduce  the  number  of  seg¬ 
ments  while  achieving  the  desired  accuracy.  There  are  two 
methods  to  reduce  the  number  of  segments. 

One  uses  a  higher  order  polynomial  to  approximate  the 
function.  In  general,  a  higher  order  polynomial  results  in 
larger  segments,  and  so  reduces  the  number  of  segments. 
However,  for  certain  functions  like  \J  —  ln(X)  and  the  en¬ 
tropy  function,  just  using  a  higher  order  polynomial  cannot 
reduce  the  number  of  segments  effectively.  Most  of  existing 
methods  [4,6-8, 18,25,26]  use  uniform  segmentation,  which 
partitions  the  domain  into  segments  with  the  same  size.  In 
such  a  segmentation,  the  most  significant  bits  of  X  are  used 
to  specify  a  segment,  and  the  least  significant  bits  determine 
a  point  within  that  segment.  The  size  of  all  segments  is  the 
same  as  the  smallest  segment  size  needed  to  achieve  the  de¬ 
sired  accuracy.  Therefore,  depending  on  functions,  uniform 
segmentation  can  yield  too  many  segments  even  if  a  higher 
order  polynomial  is  used  [15, 17]. 

To  reduce  the  number  of  segments  for  such  functions, 
there  is  another  method,  non-uniform  segmentation.  In  this 
method,  segments  are  chosen  to  be  as  wide  as  possible  while 
still  achieving  the  desired  accuracy.  Such  an  optimum  non- 
uniform  segmentation  yields  the  fewest  segments  for  the 
given  function,  and  so  reduces  memory  size  to  store  all  the 
coefficients  [15,22,24]. 

Example  2:  Fig.  2  shows  uniform  and  non-uniform  seg¬ 
mentations  of  arcsin(X),  where  X  has  6-bit  accuracy,  the 
function  is  approximated  by  quadratic  polynomials,  and  the 
acceptable  approximation  error  is  2"  8.  The  number  of  uni¬ 
form  segments  is  32,  while  the  number  of  non-uniform  seg- 


Input:  Numerical  function  f(X),  domain  A.  It)  for  X .  accuracy  m,„  of 
X,  polynomial  order  d,  and  acceptable  approximation  error  ea. 
Output:  Segments  [A, Po),  [^o-A ),■■■.  [Pr-2, 51- 
Step: 

1.  For  \A.B ) ,  compute  the  maximum  approximation  error 
ed(A,B). 

2.  If  e.jiAJl)  <  ea  or  B  A  <  2  m’’1 .  then  stop. 

3.  Else,  partition  [A .  B )  into  two  segments  [A.P)  and  t\  B).  where 
P  =  (A  +  B)/2. 

4.  Repeat  Steps  1 .  2,  and  3  for  each  new  segment  recursively,  until 
the  maximum  approximation  errors  are  smaller  than  e„  in  all 
segments. 

Fig.  3  Recursive  segmentation  algorithm  for  the  domain, 
ments  is  only  4.  (End  of  Example) 

3.2  Recursive  Segmentation  Algorithm 

Although  non-uniform  segmentation  yields  fewer  segments 
than  uniform  segmentation,  non-uniform  segmentation  re¬ 
quires  an  additional  circuit  that  maps  a  given  X  into  a  seg¬ 
ment.  Lee  et  al.  [11]  have  proposed  a  special  non-uniform 
segmentation,  hierarchical  segmentation,  to  simplify  the  ad¬ 
ditional  circuit.  However,  since  their  method  has  only  four 
segmentation  types  which  simplify  the  additional  circuit,  the 
generated  segmentation  does  not  always  adapt  to  the  de¬ 
gree  of  nonlinearity  of  the  given  function.  In  this  section, 
to  reduce  both  hardware  complexity  and  the  number  of  seg¬ 
ments,  we  present  a  new  non-uniform  segmentation  method, 
recursive  segmentation,  that  is  a  hybrid  of  the  method  [11] 
and  our  previous  method  [15, 22, 24]. 

Fig.  3  shows  the  recursive  segmentation  algorithm.  The 
inputs  for  this  algorithm  are  a  numerical  function  f(X),  a 
domain  [A,  B )  for  X,  an  accuracy  of  A,  a  polynomial  or¬ 
der  d,  and  an  acceptable  approximation  error  ea.  Then,  this 
algorithm  produces  t  segments  [A,Fo),  [PoTi)> . . . ,  [Pt~2,B) 
by  recursively  partitioning  a  segment  into  two  equal-sized 
segments  until  achieving  the  acceptable  approximation  er¬ 
ror  ea  in  all  segments.  Note  that  this  algorithm  restricts 
the  width  w,-  of  each  segment  to  vv,-  =  2h‘  x  2~m'n,  where 
hi  is  an  integer.  That  is,  the  segmentation  points  P,  are  re¬ 
stricted  to  values  of  which  the  least  significant  bits  are  0 
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(be.,  Pi  =  {...  p-j+i  p-j  00  . . .  0)2,  where  j  =  min  -  hi). 
As  shown  in  Fig.  3,  the  number  of  segments  depends  on 
the  maximum  approximation  error  zd(A,B).  In  this  paper, 
we  use  the  Chebyshev  approximation  polynomials.  For  a 
segment  [S,E]  of  f(X).  the  maximum  approximation  error 
of  the  t/th-order  Chebyshev  approximation  £,i(S.  E)  is  given 
by  [12]: 


^<s'£)=^rrt)Ts5gE^+'lmi- 


where  f(d+A  is  the  ( d  +  l)th-order  derivative  of  /. 

This  algorithm  can  be  applied  to  any  given  domain 
\A.B).  However,  a  wide  domain  necessarily  requires  a  large 
number  of  segments.  In  this  case,  we  can  reduce  the  given 
domain  to  a  narrower  domain  by  using  a  range  reduction 
technique  [1, 13],  as  is  done  with  existing  methods  based  on 
uniform  segmentation. 


3.3  Computation  of  the  Approximate  Value 

For  each  segment,  f(X)  is  approximated  by  the  cor¬ 
responding  polynomial  function  g(A,i).  That  is,  the 
approximated  value  of  f(X)  is  computed  by  g(X .  i)  = 
Cd(i)Xd  +  Cd-\{i)Xd~x  +  . . .  +  Co(i),  where  i  is  a  seg¬ 
ment  index  assigned  to  each  segment,  and  the  coeffi¬ 
cients  Q(i),Q_i(i),...,C0(t)  are  derived  from  the  r/th- 
order  Chebyshev  approximation  polynomial  [12]. 

For  each  segment  [Sj,Ei),  since  5,  <  A  <  Ei  holds,  we 
can  offset  A  by  S,  to  compute  the  polynomial  g (X,  i).  By  us¬ 
ing  the  offset  input  ( X  —  S,  ')  instead  of  A,  we  reduce  the  size 
of  multipliers  needed  to  compute  g(X,i).  By  substituting 
A  —  Sj  +  Sj  for  A,  we  transform  g(X,i)  as  follows: 

g(X,i)  =  Cd(i)Xd  +  Cd-i(i)Xd-1  +  ...  +  C0(i) 

=  Cd(i){X-Si  +  Si)d 

+Cd-i(i){X  —  Sj  +  Sj)d  1  + . . .  +  Co(i) 

=  Cd(i)(X-Si)d 

+{cd-i(i)+dcd(i)Si}(x  -  Si)d -1  + . . . 

. . .  +Co(i). 

Let  the  coefficients  of  (A  —  S,)-7  (J  =  0, 1 , . . . ,  d  —  1 )  be 

C'(i)='z(J  +  k)cj+k(i)St  (j  =  0,l,...,d  —  1). 
k=0  V  J  / 

Then,  we  have 

S(A,0  =  Cd(i)(X  —  Si)d  +C'd_i(i)(X  —  Si)d_1  + ... 

■  ■■  +  C'0(i).  (1) 

This  transformation  reduces  the  multiplier  size  (see  Sec¬ 
tion  4.3).  That  is,  instead  of  using  the  entire  value  A,  as 
in  the  approximation  Cd(i)Xd  +  Cd-i{i)Xd~ 1  +  . . .  +  Gd (i) 
requiring  the  maximum  number  of  bits  to  represent  A,  we 
use  ( 1 )  to  approximate  the  function,  where  typically  smaller 
number  of  bits  is  needed  to  realize  A  —  Sj. 


X 


f(X) 


X  =(*/_!  Xi—2  ■  ■  •  X-j  x-j- 1 
Si  =(si-isi-2...s-j  0 

■  •  •  X—mjf 
...  0 

,  h 
h 

X-Si=(  0  0  ...  0  x-j- 1 

■  •  •  X—mir 

i  )2 
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X  =(  JC/-1  Xl-2  ■  ■  ■  X-j  X-j- 1 

•  •  •  X—mjf 

,  h 

Si  =(iUIsU2  ...sZJ  1 

...  1 

h 

X&S,  =(  0  0  ...  0  x-j- 1 

■  •  •  X—mir 

,  )2 

Note:  1.  Xi—\  =  s/_ i,  xt-2  =  S/-2.  •••, 
X-j  —  S-j ,  2.  j  —  min  —  hi,  and  3.  is 
bit-by-bit  AND. 


(a)  Architecture  of  NFG.  (b)  Computation  of  X  —  Si  using  AND  gates. 


Fig.  4  Architecture  of  the  NFG  based  on  2nd-order  polynomials. 


4.  Architecture  of  the  NFG 

Fig.  4  shows  the  architecture  of  the  NFG  based  on  a  2nd- 
order  polynomial.  As  shown  in  Fig.  4(a),  polynomials  of 
the  form  (1)  are  realized  using  a  segment  index  encoder 
(, SIE ),  a  coefficient  memory,  circuits  for  (A  —  S/)k  ( k  = 

d,d—  1 _ ,2),  multipliers,  and  adders.  Since  modern  FP- 

GAs  have  logic  elements,  synchronous  memory  blocks,  and 
dedicated  multipliers,  this  architecture  is  efficiently  imple¬ 
mented  by  those  hardware  resources  in  an  FPGA.  This  ar¬ 
chitecture  can  realize  any  non-uniform  segmentation.  How¬ 
ever,  when  recursive  segmentation  is  used,  we  can  real¬ 
ize  A  —  Si  using  2-input  AND  gates  instead  of  an  adder. 
As  mentioned  in  the  previous  section,  the  least  significant 
hi  bits  of  Si  are  0,  and  A  -  5/  <  2h‘  x  2~m> "  (i.e.  x,-i  = 
si-i,xi-2  =  si- 2, •  •  •  ,x-j  =  s-j  and  =  S-j-2  =  ...= 
S—min  =  0  because  of  the  way  .S',  is  chosen.)  Therefore,  A  —  Sj 
has  1’s  only  in  the  least  significant  hj  bits,  and  these  l’s 
occur  in  exactly  the  same  position  as  the  l’s  in  A.  Thus, 
as  shown  in  Fig.  4(b),  we  realize  A  —  .S',  using  AND  gates 
driven  on  one  side  by  Sj,  the  complement  of  Sj.  The  SIE  con¬ 
verts  A  into  a  segment  index  i.  It  realizes  the  segment  index 

function  seg-func( A)  :  {0, 1}"  — >  {0,1 _ ,t  —  1}  shown  in 

Fig.  5(a),  where  A  has  n  bits,  and  t  denotes  the  number  of 
segments. 

4. 1  Architecture  of  the  SIE 

Fig.  5(b)  shows  an  LUT  cascade  [22,  23]  that  realizes 
seg-func{ A).  The  LUT  cascade  is  obtained  by  functional 
decomposition  using  an  MTBDD  for  seg-func(X)  [20, 21], 
and  can  realize  any  seg-func{ A),  where  the  size  of  the 
LUT  cascade  depends  on  the  number  of  segments.  [15] 
has  shown  that  this  size  can  be  reduced  by  reducing  the 
number  of  segments.  This  section  presents  a  new  pro¬ 
grammable  architecture  for  the  SIE  that  reduces  the  size  and 
delay  time.  Fig.  5(c)  shows  the  new  architecture.  To  re¬ 
alize  seg-func(X)  using  the  SIE  in  Fig.  5(c),  we  represent 
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Segments 

Index 

A<X<P0 

0 

Po<X<Pi 

1 

P,-2  <X  <  B 

t- 1 

(a)  Segment  index  function. 


(b)  LUT  cascade  (c)  LUT  cascade  and 
(MT_SIE).  adders  (EV_SIE). 


Fig.  5  Segment  index  encoders. 


seg-func(X)  using  an  EVBDD.  And  then,  by  decomposing 
the  EVBDD,  we  obtain  the  SIE  that  consists  of  an  LUT  cas¬ 
cade  and  adders.  In  an  LUT  cascade,  the  interconnecting 
lines  between  adjacent  LUT  memories  are  called  rails.  In 
this  case,  the  rails  represent  sub-functions  in  the  EVBDD. 
The  outputs  from  each  LUT  memory  other  than  rails  repre¬ 
sent  the  sum  of  weights  of  edges.  In  this  paper,  we  call  such 
outputs  A  mils  (adder  rails).  To  the  best  of  our  knowledge, 
this  is  the  first  design  method  using  an  EVBDD  to  produce 
the  cascaded  programmable  architecture. 

Example  3:  By  decomposing  the  MTBDD  and  EVBDD  in 
Fig.  1,  we  obtain  the  SIEs  in  Fig.  6.  Fig.  6(a)  and  (b)  illus¬ 
trate  the  correspondences  between  each  LUT  memory  and 
decompositions  of  the  MTBDD  and  the  EVBDD,  respec¬ 
tively.  In  these  figures,  the  column  labeled  as  ‘r,-’  in  the  table 
of  each  LUT  denotes  the  rails  that  represent  sub-functions 
in  BDDs.  The  column  ta{  in  Fig.  6(b)  denotes  the  Arails 
that  represent  the  sum  of  weights  of  edges.  In  the  MTBDD, 
numbers  assigned  to  edges  that  cut  across  the  horizontal 
lines  represent  sub-functions.  In  the  EVBDD,  as¬ 

signed  to  edges  that  cut  across  the  horizontal  lines  represent 
the  sum  of  weights  and  sub-functions,  respectively.  The  SIE 
in  Fig.  6(a)  requires  22  x  2  +  23  x  3  +  24  x  3  =  80  bits  and 
3  levels  (3  LUT  memories).  On  the  other  hand,  the  SIE  in 
Fig.  6(b)  requires  22  x  4  +  22  x  2  +  22  x  1  =  28  bits  and  4 
levels  (3  LUT  memories  +  1  adder).  (End  of  Example) 

This  paper  uses  two  terms:  Ml' SIE  and  EVSIE  denote  the 
SIEs  designed  using  an  MTBDD  (Fig.  5(b))  and  an  EVBDD 
(Fig.  5(c)),  respectively.  Both  the  MT_SIE  and  the  EV_SIE 
can  realize  any  non-uniform  segmentation.  In  both  cases, 
the  size  of  LUT  memories  depends  on  the  number  of  seg¬ 
ments.  Specifically, 

Theorem  1:  Let  seg-f  unc(X)  be  a  segment  index  func¬ 
tion  with  t  segments.  Then,  there  exists  an  EV_SIE  for 
seg-func(X)  with  at  most  [logit]  rails  and  [logit]  Arails. 

Proof:  See  Appendix. 

The  size  of  LUT  memories  and  the  number  of  levels  of 
an  EV_SIE  depend  on  the  decomposition  of  an  EVBDD.  To 


(a)  SIE  using  MTBDD  (MT.SIE). 


(b)  SIE  using  EVBDD  (EV.SIE). 


Fig.  6  Example  of  SIEs. 


mode  mode  mode 


Fig.  7  LUT  cascade  implemented  by  embedded  RAMs. 


obtain  the  optimum  decomposition,  we  can  use  optimiza¬ 
tion  algorithms  for  heterogeneous  multi-valued  decision  di¬ 
agrams  (MDDs)  [14], 

4.2  Programmability  of  the  SIE 

The  LUT  memories  of  the  SIEs  shown  in  Fig.  5(b)  and  (c) 
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are  implemented  by  embedded  RAMs  (e.g.  M4Ks)  in  an 
FPGA.  Thus,  by  changing  the  data  for  the  LUT  memories 
and  the  coefficient  memory,  a  wide  class  of  numerical  func¬ 
tions  can  be  realized  by  a  single  architecture.  Since  just 
changing  the  RAM  data  can  switch  numerical  functions,  we 
can  switch  functions  even  while  the  FPGA  is  running. 

Fig.  7  and  Fig.  8  show  the  details  of  the  LUT  cascade 
and  the  control  circuit  for  changing  the  RAM  data,  respec¬ 
tively.  In  these  figures,  ’mode’  denotes  a  signal  to  switch 
between  the  operation  mode  and  the  program  mode  of  the 
LUT  cascade.  The  control  circuit  consists  of  a  counter  and 
a  decoder,  and  generates  address  and  write  enable  signal  for 
each  RAM  sequentially. 

To  the  best  of  our  knowledge,  a  programmable  archi¬ 
tecture  for  the  SIE  has  never  before  been  proposed. 

4.3  Reduction  of  the  Size  of  the  Multiplier 

Since  large  multipliers  have  large  delay,  it  is  important  to 
reduce  multiplier  size.  We  do  this  in  two  ways;  Reduce  the 
number  of  bits  needed  to  represent  1 .  the  coefficients  and  2. 
the  variables  ( X  —  Si) . 

To  reduce  the  number  of  bits  in  the  coefficients,  we  use 
a  scaling  method  [10].  We  first  shift  right  the  coefficients. 
Then,  we  apply  rounding.  Then,  we  do  the  actual  multi¬ 
plication.  And,  finally,  we  shift  left  the  product  to  com¬ 
pensate  for  the  original  shift  right  of  the  coefficients.  This 
process  is  similar  to  floating  point  multiplication.  A  side 
effect  is  that  rounding  error  is  increased,  since  rounding  oc¬ 
curs  on  a  smaller  value.  In  applying  this  method,  we  choose 
the  largest  exponent  (right  shift)  that  produces  an  error  no 
greater  than  the  given  acceptable  error  [15].  If  this  yields  an 
exponent  of  0  (no  right  shift),  in  all  segments,  then  we  do 
not  use  the  scaling  method. 

To  reduce  the  value  of  the  variable  X  —  Si,  we  make 
the  following  observation.  In  each  segment  [5/ ,£)),  we  have 
X  —  Si  <  Ej  —  Sj.  Thus,  reducing  the  segment  width  E,  —  Si 
reduces  X  —  Si  for  X  near  £).  However,  this  also  increases 
the  number  of  segments,  and  thus  the  size  of  coefficient 
memory.  We  show  a  segment  reduction  technique  that  does 
not  increase  the  coefficient  memory  size. 

In  an  FPGA  implementation,  the  coefficient  memory 
in  Fig.  4  has  2"  words,  where  u  =  [log2f]  and  t  is  the  num¬ 
ber  of  segments.  Therefore,  we  can  increase  the  number 
of  segments  up  to  t  =  2"  without  increasing  the  coefficient 
memory  size.  From  Theorem  1,  the  size  of  the  EV_SIE  also 
depends  on  the  value  of  u.  Increasing  the  number  of  seg¬ 
ments  to  t  =  2"  rarely  increases  the  size  of  the  EV_SIE.  We 
reduce  the  size  of  segments  by  dividing  the  largest  segment 
into  two  equal  sized  segments  up  to  t  =  2“. 

5.  Experimental  Results 

5.1  Number  of  Segments  and  Computation  Time 

Table  1  compares  the  number  of  segments  for  various  seg¬ 
mentation  methods  based  on  a  2nd-order  Chebyshev  ap- 


Table  1  Number  of  segments  for  various  segmentation  methods. 


X  has  23 -bit  accuracy. 

Acceptable  approximation  error:  2~25 

Function 

m 

Domain 

[A.B) 

No.  of 
uniform 
segs 

No.  of 
nonuni. 

segs 

Recursive 

No.  of 
segs  1 

No.  of 
segs  2 

Time 

[msec.] 

e* 

sin(7tX) 

tan(7tX) 

arcsin(X) 

Vx 

Xln(X) 

[0,1) 

[0,0.5) 

[0,0.5) 

[0,1) 

(0,1) 

(0,1) 

(0,1) 

128 

128 

4,194,304 

8,388,608 

8,388,607 

8,388,607 

2,097,152 

67 

74 

4,594 

256 

228 

698 

172 

103 

112 

5,723 

363 

322 

967 

250 

128* 

128* 

8,192 

512 

512 

1,024 

256 

10 

10 

1,600 

70 

30 

190 

10 

*Uniform  segmentation  is  produced. 

Environment:  Sub  Blade  2500  (Silver),  UltraSPARC-IIIi  1.6GHz, 
6GB  memory,  Solaris  9. 


proximation.  In  Table  1,  “No.  of  uniform  segs”  shows 
the  number  of  uniform  segments,  “No.  of  nonuni.  segs” 
shows  the  number  of  non-uniform  segments  produced  by 
[15],  and  “Recursive”  denotes  the  recursive  segmentation 
method  shown  in  this  paper.  In  the  column  “Recursive”, 
the  sub-column  “No.  of  segs  1”  shows  the  number  of  seg¬ 
ments  produced  by  the  segmentation  algorithm  shown  in 
Section  3.  The  sub-column  “No.  of  segs  2”  shows  the  num¬ 
ber  of  segments  produced  by  additionally  applying  the  re¬ 
duction  method  of  multiplier  size  shown  in  Section  4.  The 
sub-column  “Time”  shows  the  total  CPU  time,  in  millisec¬ 
onds,  for  both  the  segmentation  algorithm  and  the  reduction 
method  of  multiplier  size. 

Table  1  shows  that  uniform  segmentation  requires  ex¬ 
cessively  many  segments  to  approximate  certain  functions, 
such  as  tanfttA' ).  Existing  methods  based  on  uniform  seg¬ 
mentation  cannot  implement  those  functions  in  conven¬ 
tional  FPGAs  because  the  required  coefficient  memory  is 
too  large.  Actually,  many  existing  methods  have  not  real¬ 
ized  tan(7tX)  in  domain  [0,0.5).  This  is  because  tan(jtX)  in 
[0,0.5)  can  be  computed  by  sin(7tX)/ cos(7tX)  or  a  combi¬ 
nation  of  tan(7tX)  in  [0,0.25]  and  l/tan(jtA'),  where  X'  = 
0.5  —  X ,  and  those  functions  can  be  implemented  by  the 
existing  methods.  However,  these  require  multiple  NFGs 
that  realize  elementary  functions,  such  as  sin,  cos,  or  the 
reciprocal  function.  On  the  other  hand,  the  non-uniform 
or  recursive  segmentation  described  here  can  compactly  re¬ 
alize  tan(ttA)  with  a  single  NFG,  since  non-uniform  and 
recursive  segmentation  methods  require  many  fewer  seg¬ 
ments.  For  all  functions  in  Table  1,  the  non-uniform  seg¬ 
mentation  method  [15]  requires  the  fewest  segments  among 
the  three  segmentation  methods.  Although  our  recursive 
segmentation  algorithm  restricts  the  segmentation  points, 
it  requires  only  up  to  2.2  times  more  segments  than  non- 
uniform  segmentation  [15].  That  is,  our  recursive  segmen¬ 
tation  algorithm  generates  a  segmentation  appropriate  to  the 
given  function,  while  restricting  the  segmentation  points. 
Thus,  our  recursive  segmentation  produces  a  small  coeffi¬ 
cient  memory  for  the  given  function.  In  the  next  section,  we 
show  that  recursive  segmentation  reduces  the  size  of  SIE,  as 
well. 
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Table  2  FPGA  implementation  of  SIEs. 


FPGA  device:  Altera  Stratix  EP1S10F484C5  (LE:  10,570,  M4K:  60,  M512:  90) 

Logic  synthesis  tool:  Altera  QuartusII  5.0  (speed  optimization,  timing  requirement  of  200MHz) 


Function 

m 

Optimum 

non-uniform  segmentation 

Recursive  segmentation 

r  MT.SIE 

EV.SIE  1 

MT.SIE 

EV.SIE 

LUT  size 
[bits] 

LE 

Level 

Delay 

[nsec.] 

LUT  size 
[bits] 

LE 

Level 

Delay 

[nsec.] 

LUT  size 
[bits] 

LE 

Level 

Delay 

[nsec.] 

LUT  size 
[bits] 

LE 

Level 

Delay 

[nsec.] 

e* 

26,368 

115 

8 

58.8 

23,040 

154 

7 

43.5 

0 

0 

0 

0 

0 

0 

0 

0 

sin(rcX) 

26,880 

115 

8 

54.9 

23,552 

151 

7 

45.1 

0 

0 

0 

0 

0 

0 

0 

0 

tan(jtX) 

1,802,240 

- 

5 

- 

179,968 

340 

10 

75.3 

1,687,552 

- 

5 

- 

15,108 

201 

7 

43.6 

arcsin(X) 

61,440 

123 

8 

53.2 

53,824 

336 

13 

87.6 

49,152 

109 

7 

46.7 

9,984 

107 

5 

28.0 

Vx 

61,440 

123 

8 

53.2 

57,408 

289 

11 

75.5 

44,544 

107 

7 

46.4 

10,752 

112 

5 

27.2 

V— ln(X) 

266,240 

138 

7 

56.7 

116,160 

330 

11 

81.4 

172,032 

118 

7 

54.3 

12,736 

148 

6 

38.4 

Xln(X) 

61,440 

129 

8 

52.5 

48,400 

250 

10 

63.8 

20,992 

67 

5 

30.8 

8,960 

74 

4 

22.9 

-:  It  cannot  be  mapped  into  the  FPGA  due  to  insufficient  RAM  blocks. 


Table  3  FPGA  implementation  of  23-bit  precision  (23-bit  accuracy)  NFGs. 


FPGA  device: 

Logic  synthesis  tool: 

Altera  Stratix  EP1S60F1020C5 

(LE:  57,120,  DSP:  144,  M4K:  292,  M512:  574) 

Altera  QuartusII  5.0 

(speed  optimization,  timing  requirement  of  200MHz) 

Function 

f(x) 

MTNFG  based  on  optimum  nonuni. 

EVNFG  based  on 

recursive 

Memory 

[bits] 

LE 

DSP 

Level 

Delay 

[nsec.] 

Memory 

[bits] 

LE 

DSP 

Level 

Delay 

[nsec.] 

e* 

39,040 

689 

10 

13 

99.6 

8,064 

432 

10 

3 

25.1 

sin(jtX) 

36,864 

635 

10 

13 

99.1 

7,936 

395 

10 

3 

28.3 

tan(jtX) 

2,867,200 

- 

16 

11 

- 

973,572 

1,059 

16 

12 

92.3 

arcsin(X) 

84,736 

1,301 

16 

14 

107.3 

53,504 

937 

16 

10 

80.3 

83,712 

1,041 

16 

14 

116.5 

53,760 

917 

16 

10 

77.2 

V-Hx) 

357,376 

950 

16 

13 

99.8 

103,872 

972 

16 

11 

88.3 

Xln(X) 

83,200 

988 

16 

14 

116.0 

31,744 

989 

16 

9 

70.4 

-:  It  cannot  be  mapped  into  the  FPGA  due  to  insufficient  RAM  blocks. 


5.2  FPGA  Implementation  of  SIEs 

Table  2  compares  the  FPGA  implementation  results  of  the 
MT_SIE  and  EV_SIE.  In  this  table,  “LUT  size”  shows  the 
total  size  of  LUT  memories  used  in  the  SIE,  in  bits.  Note 
that  the  size  of  LUT  memory  and  LE,  the  number  of  logic 
elements,  are  0  for  ex  and  sin(7tX)  when  recursive  segmen¬ 
tation  is  used.  This  is  because  our  algorithm  generated  uni¬ 
form  segments  as  shown  in  Table  1 ,  and  so  an  SIE  was  not 
needed.  In  the  experiment  that  produced  the  data  in  Table  2, 
we  optimized  the  decomposition  of  the  MTBDDs  and  EVB- 
DDs  by  requiring  the  size  of  each  LUT  memory  used  in 
these  SIEs  to  be  4K  bits,  the  same  as  the  RAM  block  (M4K) 
of  the  FPGA. 

Table  2  shows  that,  for  optimum  non-uniform  segmen¬ 
tation,  the  EV_SIEs  have  smaller  LUT  memory  size  than 
the  MT_SIEs.  For  example,  for  tan (nX ) ,  the  LUT  memory 
size  of  the  EV_SIE  is  only  10%  of  the  size  needed  by  the 
MT_SIE.  For  tan(jtX'),  the  LUT  memory  size  of  MT_SIE 
is  quite  large  because  the  number  of  non-uniform  segments 
is  large.  From  experiments  with  uniform  segmentation,  we 
know  that  for  tan(jtA),  the  total  memory  size  needed  by  the 
NFG  using  the  MT_SIE  is  only  1.5%  of  the  total  memory 
size  needed  by  the  NFG  based  on  uniform  segmentation  (i.e. 
the  existing  methods).  However,  this  is  still  too  large  to  im¬ 
plement  with  the  FPGA.  By  using  the  EV_SIE,  we  can  re¬ 


duce  the  LUT  memory  size  significantly,  and  make  the  NFG 
implementable  with  the  FPGA. 

Our  recursive  segmentation  can  reduce  both  the  LUT 
memory  size  and  the  delay  time  of  the  MT_SIEs.  Espe¬ 
cially,  for  Xln(X),  using  an  MT_SIE  designed  for  recursive 
segmentation  has  only  34%  of  the  LUT  memory  size  and 
59%  of  the  delay  of  the  MT_SIE  designed  for  optimum  non- 
uniform  segmentation. 

By  using  recursive  segmentation  and  the  EV_SIE,  we 
can  reduce  both  LUT  memory  size  and  delay  time  of  SIEs 
significantly.  For  all  functions  in  Table  2,  both  LUT  memory 
size  and  delay  time  of  the  EV_SIEs  for  recursive  segmenta¬ 
tion  are  much  smaller  for  the  MT_SIEs.  In  terms  of  the  num¬ 
ber  of  LEs,  the  EV_SIEs  for  recursive  segmentation  require 
only  up  to  1.3  times  more  LEs  than  the  MT_SIEs.  There¬ 
fore,  designing  an  EV_SIE  for  recursive  segmentation  yields 
faster  and  more  compact  SIEs  than  obtained  by  previous 
methods.  The  design  is  formal  and  is  easily  programmed. 

5.3  FPGA  Implementation  of  NFGs 

Table  3  compares  the  FPGA  implementation  results  of  our 
NFGs  using  EV_SIE  (EVNFGs)  with  the  existing  NFGs  us¬ 
ing  MT_SIE  (MTNFGs)  [15],  where  EVNFGs  are  based  on 
recursive  segmentation  and  MTNFGs  are  based  on  the  op¬ 
timum  non-uniform  segmentation.  Both  NFGs  have  23-bit 
precision  (23-bit  accuracy). 
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Table  4  FPGA  implementation  of  24-bit  precision  NFGs. 


FPGA  device: 

Xilinx  Virtex-II  XC2V4000-6 

Logic  synthesis  tool: 

Synplify  Premier  Ver.  8.5 

Function 

NFG  in  [11] 

EVNFG 

m 

Memory 

Slice 

Mult. 

Level 

Delay 

Memory 

Slice 

Mult. 

Level 

Delay 

[bits] 

[nsec.] 

[bits] 

[nsec.] 

Xln(X) 

40,446 

871 

10 

14 

103.7 

34,560 

454 

5 

9 

64.7 

humps 

NA 

409 

4 

13 

82.8 

91,648 

189 

4 

8 

55.9 

Table  5  Comparison  of  design  methods. 


SIEs 

Segmentatic 

Non-uniform 

>n  methods 

Recursive 

MT 

•  Smaller  coefficient  memory 

•  Largest  and  slower  SIE 

•  Larger  coefficient  memory 

Smaller  and  faster  SIE 

EV 

•  Smaller  coefficient  memory 

•  Larger  and  slowest  SIE 

•  Larger  coefficient  memory 

Smallest  and  fastest  SIE 

From  Table  2  and  Table  3,  we  can  see  that  the  LUT 
memory  size  of  MT_SIE  accounts  for  more  than  2/3  of  the 
total  memory  size  of  the  MTNFG.  On  the  other  hand,  by 
using  recursive  segmentation  and  EV_SIE,  the  LUT  mem¬ 
ory  size  needed  for  the  SIE  can  be  reduced  to  less  than  1/4 
of  the  total  memory  size  of  the  EVNFG.  Thereby,  the  EVN- 
FGs  require  only  2 1  %  to  64%  of  memory  size  needed  for  the 
MTNFGs.  For  arcsin(X)  and  \/~X,  as  shown  in  Table  1,  our 
recursive  segmentation  requires  a  coefficient  memory  that  is 
about  twice  as  large  as  needed  for  the  optimum  non-uniform 
segmentation.  Nevertheless,  by  using  EV_SIEs,  the  total 
memory  sizes  of  EVNFGs  can  be  reduced  to  about  60%  of 
the  memory  sizes  of  MTNFGs.  Further,  Table  3  shows  that 
EVNFGs  require  fewer  LEs  and  levels  (i.e.,  shorter  latency) 
than  MTNFGs,  and  the  delay  time  of  EVNFGs  is  only  about 
24%  to  94%  of  the  delay  time  of  MTNFGs. 

To  compare  our  NFG  with  the  existing  NFG  based 
on  another  non-uniform  segmentation  method  (hierarchical 
segmentation)  shown  in  [11],  we  implemented  our  24-bit 
precision  NFGs  for  Xln(X)  and  “humps”  function  using  the 
Xilinx  Virtex-II  FPGA  (XC2V4000-6)  and  the  Synplify  Pre¬ 
mier  8.5.  The  humps  function  is  a  quotient  of  polynomials 

0.0004.V+  0.0002 

UmpS  ~  x4  -  1 ,96x3  +  1 .348x2  -  0.378x  +  0.0373 ' 

Table  4  compares  the  FPGA  implementation  results  of  our 
NFGs  and  the  NFGs  shown  in  [  1 1  ].  For  the  humps  function, 
the  memory  size  of  the  NFG  is  not  shown  in  [1 1  ]. 

From  these  results,  we  can  see  that  our  NFGs  using 
recursive  segmentation  and  the  EV_SIE  can  realize  a  wide 
range  of  functions  faster  and  more  compactly  than  existing 
NFGs. 

5.4  Comparison  of  Design  Methods 

As  for  the  segmentation,  we  have  two  methods:  optimum 
non-uniform  and  recursive,  and  as  for  the  SIE,  we  have  two 
methods:  MT_SIE  and  EV_SIE.  Thus,  there  exist  four  differ¬ 
ent  design  methods.  Table  5  compares  four  design  methods: 


they  are  abbreviated  as  Non-uniform_MT,  Recursive  _MT, 
Non-uniform_EV,  and  Recursive_EV. 

Roughly  speaking,  the  non-uniform  segmentation  pro¬ 
duces  a  smaller  coefficient  memory,  but  a  larger  and  slower 
SIE.  On  the  other  hand,  the  recursive  segmentation  produces 
a  larger  coefficient  memory,  but  a  smaller  and  faster  SIE. 
As  for  the  SIE,  the  EV_SIE  requires  smaller  LUT  memories 
than  the  MT_SIE.  However,  the  EV_SIE  requires  a  cascade 
of  adders  that  often  makes  the  NFG  slower  than  one  with 
the  MT.SIE. 

Non-uniform_MT  produces  a  small  coefficient  mem¬ 
ory,  but  its  SIE  has  the  largest  LUT  memory  size  among  the 
four  methods.  Thus,  Non-uniform_MT  results  in  NFGs  with 
large  memory  size. 

RecursiveJVIT  produces  a  smaller  and  faster  SIE  than 
Non-uniformJVIT.  Flowever,  the  reduction  of  LUT  memory 
size  in  the  SIE  is  insufficient  to  compensate  for  the  increase 
in  the  coefficient  memory.  Thus,  NFGs  with  RecursiveJVIT 
are  faster  than  with  Non-uniformJVIT,  but  still  require  large 
memory  size. 

Non-uniform_EV  produces  a  small  coefficient  mem¬ 
ory  and  the  SIE  with  smaller  LUT  memories  than  Non- 
uniformJVIT.  However,  the  SIE  is  the  slowest  among  the 
four  methods.  Thus,  NFGs  with  Non-uniform J3V  require 
smaller  memory  size  than  with  Non-uniformJVIT,  but  they 
are  the  slowest  among  the  four  methods. 

Recursive_EV  produces  the  smallest  and  the  fastest  SIE 
among  the  four.  And,  the  reduction  of  LUT  memory  size  in 
the  SIE  is  sufficient  to  compensate  for  the  increase  in  the 
coefficient  memory.  Thus,  NFGs  with  Recursive  J3V  are  the 
smallest  and  the  fastest  among  the  four. 

6.  Concluding  Remarks 

We  have  presented  design  methods  for  numerical  function 
generators  using  recursive  segmentation  and  EVBDDs.  Our 
recursive  segmentation  is  a  hybrid  approach  of  an  optimum 
non-uniform  segmentation  that  produces  the  fewest  seg¬ 
ments  and  a  segmentation  that  reduces  hardware  complexity. 
Thus,  our  recursive  segmentation  reduces  the  sizes  of  both 
the  coefficient  memory  and  the  SIE.  We  have  proposed  a 
new  programmable  architecture  and  its  design  method  using 
EVBDDs.  We  have  shown  that  using  both  the  new  segmen¬ 
tation  method  and  the  new  architecture  can  produce  faster 
and  more  compact  programmable  NFGs  than  the  existing 
NFGs.  We  also  show  that  an  adder  can  be  replaced  by  a  set 
of  2-input  AND  gates,  thus  reducing  the  delay.  Experimen¬ 
tal  results  show 
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1 .  recursive  segmentation  yields  MT_SIEs  that  have  only 
49%  of  the  LUT  memory  size  and  55%  of  the  delay  of 
MT_SIEs  based  on  optimum  segmentation 

2.  by  using  EVBDDs  to  realize  recursive  segmentation, 
we  reduce  LUT  memory  size  and  delay  of  the  segment 
index  encoders.  On  the  average,  this  yields  EV_SIEs 
that  require  only  8%  of  the  LUT  memory  size  and  36% 
of  the  delay  of  segment  index  encoders  designed  by 
optimum  non-uniform  segmentation,  and 

3.  overall,  our  NFGs  require,  on  the  average,  only  39% 
of  the  memory  and  58%  of  the  delay  associated  with 
NFGs  based  on  MT_SIEs  and  optimum  non-uniform 
segmentation. 

These  results  are  for  a  suite  of  seven  functions,  including  ex, 
sin(TLr),  and  yfx. 
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Appendix  A:  Proof  of  Theorem  1 

Theorem  1:  Let  seg-func(X)  be  a  segment  index  func¬ 
tion  with  t  segments.  Then,  there  exists  an  EV_SIE  for 
seg-func(X )  with  at  most  [log2f]  rails  and  [log2r]  Arails. 

Proof:  The  second  part  of  the  hypothesis  concerning  the 
number  of  Arails  was  proven  in  [22].  Specifically,  it  was 
shown  that 

Theorem  A:  [22]  Let  seg-func(X)  be  a  segment  index 
function  with  t  segments.  Then,  there  exists  an  LUT  cas¬ 
cade  for  seg-func(X)  with  at  most  [log2t]  rails. 

Each  node  in  an  MTBDD  represents  the  Shannon  expan¬ 
sion:  Xj  ■  f\  +Xj  ■  /o,  where  /o  and  f\  are  sub-functions  with 
respect  to  x\  =  0  and  x,  =  1 ,  respectively.  On  the  other  hand, 
each  node  in  an  EVBDD  represents  the  expansion: 

Xi(a  +  /i)+xrfo, 

where  f\  =  a  +  /{ ,  and  a  is  a  constant  value  of  the  sub¬ 
function  /i .  Note  that,  in  an  EVBDD,  a  is  the  weight  of  a 
1-edge.  Let  /je  be  the  number  of  distinct  sub-functions  pro¬ 
duced  by  this  expansion,  and  let  jjs  be  the  number  of  distinct 
sub-functions  by  the  Shannon  expansion.  Then,  we  have 


He  <  Ms- 


Theorem  A  shows  that  seg-func(X )  can  be  represented  by 
an  MTBDD  in  which  the  number  of  distinct  sub-functions 
with  respect  to  each  X(  is  at  most  t.  Thus,  there  exists 
an  EVBDD  for  seg-func(X)  that  has  at  most  t  distinct 
sub-functions  with  respect  to  each  x;.  In  an  EVBDD  for 
seg-func(X ),  the  sum  of  weights  of  edges  on  a  path  is  at 
most  t,  since  each  weight  is  a  non-negative  integer.  There¬ 
fore,  we  have  Theorem  1 .  I 
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